[KV Connector] Canonical KV Cache Allocation for HMA Models by Etelis · Pull Request #37885 · vllm-project/vllm

Etelis · 2026-03-23T11:56:40Z

This is the first phase of a multi-phase effort to enable contiguous KV cache allocation for all model architectures. Currently, only single-group (uniform) models benefit from contiguous cross-layer blocks. This PR extends that to HMA models with uniform page sizes. Future phases will broaden support to models with varying page sizes and additional architectures.

The existing allocate_uniform_kv_caches path only supports single-group models (all layers identical). HMA models like Gemma 3 have multiple KV cache groups (full attention + sliding window) with different eviction policies but the same page size. Previously, these models fell back to per-layer allocation, which scatters block data across non-contiguous memory regions, making RDMA transfers inefficient.

This PR extends contiguous KV cache allocation to HMA models where all KV cache groups share the same page size.

Test plan

Unit tests: pytest tests/v1/kv_connector/unit/test_canonical_kv_caches.py -v -s
- Happy path for use_canonical_kv_caches
- Parametrized rejection cases (single group, no connector, no HMA, mamba layers, no stride order)
- Allocation correctness: shapes, memory sharing, physical contiguity, group refs, page sizes

Related PRs

#34373 — Original KVCacheTopology PR (closed, too complex).
#37339 — WorkerConnectorInitializationData pattern. We adopt their interface design -- Hopefully to be merged after that PR.

mergify · 2026-03-23T11:57:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Etelis.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

gemini-code-assist

Code Review

This pull request introduces canonical KV cache allocation for Hybrid Multi-Attention (HMA) models, specifically targeting those with uniform page sizes. This is a significant improvement as it enables contiguous cross-layer block allocation, which was previously limited to single-group models. The changes involve new data structures to represent canonical KV caches and their references, along with modifications to the KV cache allocation logic within the gpu_model_runner and kv_connector_model_runner_mixin. A comprehensive unit test suite has been added to validate the new allocation strategy under various conditions, including happy paths and rejection cases. The implementation appears well-considered and robust, addressing the stated goal of improving RDMA transfer efficiency by ensuring memory contiguity for HMA models.

mergify · 2026-03-23T12:09:50Z

Hi @Etelis, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery · 2026-03-23T12:15:06Z

+        # The connector must support HMA
+        if not supports_hma(get_kv_transfer_group()):
+            return False
+        if len(kv_cache_config.kv_cache_groups) <= 1:


Suggested change

if len(kv_cache_config.kv_cache_groups) <= 1:

if len(kv_cache_config.kv_cache_groups) < 1:

Done,
makes sense.

orozery · 2026-03-23T12:21:11Z

+        if len(kv_cache_config.kv_cache_groups) <= 1:
+            return False
+
+        # All groups must use AttentionSpec with uniform page size


Suggested change

# All groups must use AttentionSpec with uniform page size

# Currently, all groups must use AttentionSpec with uniform page size

# We plan to gradually relax this requirement to support other cases

Thanks, sorry.

orozery · 2026-03-23T12:27:31Z

+        spec = kv_cache_config.kv_cache_groups[0].kv_cache_spec
+        assert isinstance(spec, AttentionSpec)


Can we remove this and use the spec inside the loop per each group?

orozery · 2026-03-23T12:30:37Z

            )
            self.cross_layers_kv_cache = cross_layers_kv_cache
            self.cross_layers_attn_backend = attn_backend
+        elif self.use_canonical_kv_caches(


Let's move this check before checking use_uniform_kv_cache.

orozery · 2026-03-23T13:28:33Z

+        kernel_num_blocks = num_blocks * num_blocks_per_kv_block
+
+        # prepend a group_size dimension into the shape
+        kv_cache_shape = attn_backend.get_kv_cache_shape(


Can we move this logic AFTER we allocate the single tensor?

Then, inside the layer loop, we can reshape?
I think we can also remove assert len(unique_kernel_bs) == 1.

I think it's better to also build the group_data_refs inside the same loop.

orozery · 2026-03-23T13:29:57Z

    @property
    def needs_kv_cache_zeroing(self) -> bool:
        return self.has_mamba_layers
+


These classes are currently specific to connector usage.
I think we should move them to base.py.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

…nector Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery · 2026-03-23T16:49:10Z

+                    WorkerConnectorInitializationData,
+                )
+
+                kv_transfer_group.initialize_worker_connector(


Actually, initialize_worker_connector is needed for the CacheBlend use-case.
Let's try to call it exactly as in #37339.
But keep this if here and simply pass, commenting that the canonical kv caches will be registered below.

I thought they'd add it themselves afterwards,
nvm I will fix it.

Combine the kv_caches population, block tensor splitting, and layer-to-position mapping into a single pass over positions. Remove the unique kernel block size assertion. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Call initialize_worker_connector unconditionally so connectors like CacheBlend can use it regardless of the allocation path taken. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery · 2026-03-24T08:22:20Z

+                canonical_kv_caches is the CanonicalKVCaches wrapping
+                    for the connector.
+        """
+        # all tensors have the same size (validated by use_canonical_kv_caches)


Where did we validate this?

@orozery sharp eye

fixed.

Move the uniform tensor size check into use_canonical_kv_caches so the precondition is validated before entering the allocation path, keeping the assert in allocate_canonical_kv_caches as a safety net. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery

Thanks @Etelis !
Can you please test this PR on top of this branch?
https://github.com/orozery/vllm/tree/kv-offload-hma
Specifically, verify test_cpu_offloading.py passes, and whether we see performance gains.

orozery · 2026-04-06T16:06:03Z

+            [] for _ in kv_cache_config.kv_cache_groups
+        ]
+
+        kernel_block_size = kernel_block_sizes[0]


Can we initialize kernel_block_size = kernel_block_sizes[gid] inside the loop?

Done, I didn't think of the fact backends could have different kernel block sizes.

orozery · 2026-04-06T16:16:28Z

+            block_tensor = typed_buffer.select(group_dim, i)
+            tensor_idx = len(block_tensors)
+            page_bytes = block_tensor[0].numel() * block_tensor.element_size()
+            block_tensors.append(


Aren't we expecting a single cross-layers tensor? With shape (num_blocks, page_size) and dtype int8?

Yeah, that's dumb.
Fixed.

Replace per-position KVCacheBlockTensor objects with a single (num_blocks, cross_layer_page_size) int8 tensor. This avoids recomputing block tensors per position and matches the pattern used by the offloading connector's register_cross_layers_kv_cache. Also use per-group kernel_block_sizes[gid] inside the loop instead of hardcoded kernel_block_sizes[0]. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis · 2026-04-12T15:33:45Z

Thanks @Etelis ! Can you please test this PR on top of this branch? https://github.com/orozery/vllm/tree/kv-offload-hma Specifically, verify test_cpu_offloading.py passes, and whether we see performance gains.

Running on your branch I have hit some issues with the connector not implementing theinitialize_worker_connector
I have fixed that here:
orozery#1

Ran on top of that branch with an A100
Gemma 3 (HMA)

Metric	Baseline (per-layer allocation)	With Canonical Allocation	Improvement
Cold start	56.12ms	51.60ms	-8.1%
GPU hit	12.65ms	12.42ms	-1.8%
CPU hit	20.68ms	18.75ms	-9.3%

Running other models as well so I'll update soon.

orozery · 2026-04-13T12:48:44Z

+            for layer_name in kv_cache_tensor.shared_by:
+                layer_gid = layer_to_group_idx[layer_name]
+                group_data_refs[layer_gid].append(
+                    KVCacheBlockDataRef(
+                        tensor_idx=0,
+                        page_size_bytes=page_size,
+                    )
+                )


We should have a single data reference per group.

Drop the duplicate KVCacheBlockTensor / KVCacheBlockDataRef / CanonicalKVCaches dataclasses from kv_connector/v1/base.py and import the existing types from vllm.v1.kv_offload.spec. Emit a single data reference per group instead of one per layer. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

…ge size Restore KVCacheBlockTensor / KVCacheBlockDataRef / CanonicalKVCaches in kv_connector/v1/base.py (these types are connector-owned) and fix KVCacheBlockDataRef.page_size_bytes to cover all layers in the group (page_size * group_size) now that we emit a single ref per group. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

- use_canonical_kv_caches: < 2 -> < 1 to allow single-group HMA models - allocate_canonical_kv_caches: read num_blocks from config and assert - allocate_canonical_kv_caches: drop unreachable try/except around get_kv_cache_stride_order (guard already validates it succeeds) Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery

LGTM. Thanks @Etelis !

@NickLucche @heheda12345 WDYT?

This PR extends cross-layers layout to models with multiple groups, but all attention (e.g. gpt-oss).
More importantly, it defines a generic API (CanonicalKVCaches) for describing the KV caches (either cross layers or not) to the connector.
It is meant to replace register_cross_layers_kv_cache, which is kept for now for backward compatibility.
This API could support (without any extending) models using mamba or hybrid mamba/attention. (We plan that to be a follow-up to this PR).
Also, this API can later be extended to include striding information for connectors doing hetro-TP transfers

orozery · 2026-05-05T06:43:35Z

@Etelis Now that the offloading connectors supports HMA, we should add its initialize_worker_connector here.

…HMA Models Applied squashed diff from vllm-project#37885 Original author: Itay Etelis <itay.etelis@ibm.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

NickLucche · 2026-05-05T16:26:21Z

Can review tomorrow, apologies for the delay

orozery · 2026-05-05T17:09:26Z

Tested performance with gpt-oss-20b on H100:
Time to complete 128K token prefill loading from CPU (offloading connector):
main: 286ms
this PR: 56ms

NickLucche

I guess one main question, why aren't we folding this into the cross-layer functionality as a natural extension to hma?
It feels like things are tied in the current formulation (canonical implies cross-layer) and yet there ares separate branches in shared code path like the runner

NickLucche · 2026-05-06T07:43:04Z

+class CanonicalKVCaches:
+    """
+    Canonicalized block-level representation of the KV caches.
+
+    Composed of:
+        - Unique list of KV cache data tensors,
+          each with shape (num_blocks, page_size_in_bytes) and int8 dtype.
+        - Per-group data references of the tensors.
+          i.e. how each KV cache group maps to the tensors.
+    """
+
+    # Ordered list of unique block tensors, each with shape
+    # (num_blocks, ...).
+    tensors: list[KVCacheBlockTensor]
+    # Per-KV-cache-group list of data references that map each layer
+    # in the group to the appropriate entry in the tensors list.
+    group_data_refs: list[list[KVCacheBlockDataRef]]


I am not sure these dataclasses about tensors belong here with kv_connector interface. They look a lot more related to whats in kv_cache_manager.py.

I'd rather keep this file lean for the actual interface.

Connectors need a way to know how to access the KV cache tensors.
Currently, connectors have 2 tasks:

Determine the topology for each KV cache tensor

Determine how each group maps to each KV cache tensor (using KVCacheConfig)

Using the canonical KV caches saves connectors these 2 tasks:

All tensors are (num_blocks, ) first

group_data_refs describes how each group maps to tensors.

With cross-layers layout you cannot use KVCacheConfig as the tensors (single one) do not match kv_cache_config.kv_cache_tensors.

NickLucche · 2026-05-06T07:49:41Z

+        if len(page_sizes) != 1:
+            return False


isn't this case unexpected if we take UniformTypeKVCacheSpecs out of the equation?
We should probably assert or at least log

This is the use_canonical_kv_caches function.
The purpose of this function is to determine if should allocate cross-layers or use the regular allocation.
If the spec is UniformTypeKVCacheSpecs we should return False to disable cross layers allocation (for now).

NickLucche · 2026-05-06T07:52:32Z

+                # num_blocks must be the leading physical dimension.
+                # +1 accounts for the prepended group_size dimension.
+                if stride_order[0] != kv_cache_shape.index(1234) + 1:
+                    return False


can we unify to use get_kv_cache_block_dim to get the block dim?

We already get the kv_cache_shape to check if cross-layers is supported, so it seems easier and more efficient to leave this check here.

NickLucche · 2026-05-06T07:56:53Z

+                if len(stride_order) != len(kv_cache_shape) + 1:
+                    return False


I feel like some of this invariants could be asserted by use_uniform_kv_cache per-group?

use_uniform_kv_cache should actually be deprecated, along with register_cross_layers_kv_cache.

I suggest that we remove it.
Connectors using prefers_cross_layers_block will simply fall back to the regular register_kv_caches.

What do you think?

Etelis requested review from ApostaC, NickLucche, heheda12345, njhill and orozery as code owners March 23, 2026 11:56

mergify Bot added the v1 label Mar 23, 2026

mergify Bot added needs-rebase kv-connector labels Mar 23, 2026

EtelisIBM added 5 commits March 23, 2026 13:57

Add CanonicalKVCaches data classes for HMA KV cache representation

52688de

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add WorkerConnectorInitializationData and initialize_worker_connector

e25e020

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add canonical KV cache allocation for HMA models

01b7897

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Wire up canonical KV cache allocation in gpu_model_runner

03903f1

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add unit tests for canonical KV cache allocation

9697d1c

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis force-pushed the canonical-kv-caches branch from aa5d414 to 9697d1c Compare March 23, 2026 11:58

mergify Bot removed the needs-rebase label Mar 23, 2026

gemini-code-assist Bot reviewed Mar 23, 2026

View reviewed changes

Fix mypy error: rename shadowed variable in use_canonical_kv_caches

d9f7203

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery requested changes Mar 23, 2026

View reviewed changes

EtelisIBM added 3 commits March 23, 2026 15:53

Move canonical KV cache dataclasses to connector base

f39ae32

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Address CR: relax group count check and use per-group spec

68ce39a

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Address CR: prioritize canonical path and scope initialize_worker_con…

5b2b3bc

…nector Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis force-pushed the canonical-kv-caches branch from db277a8 to 5b2b3bc Compare March 23, 2026 13:54

orozery reviewed Mar 23, 2026

View reviewed changes

EtelisIBM added 2 commits March 23, 2026 19:23

Address CR: merge loops in allocate_canonical_kv_caches

3f90424

Combine the kv_caches population, block tensor splitting, and layer-to-position mapping into a single pass over positions. Remove the unique kernel block size assertion. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Address CR: always call initialize_worker_connector

432d002

Call initialize_worker_connector unconditionally so connectors like CacheBlend can use it regardless of the allocation path taken. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis force-pushed the canonical-kv-caches branch from 30cca9f to 432d002 Compare March 23, 2026 17:27

orozery reviewed Mar 24, 2026

View reviewed changes

arbi-dev mentioned this pull request Apr 5, 2026

[Core] Per-group BlockPool for hybrid Mamba/attention models #39031

Closed

5 tasks

EtelisIBM and others added 2 commits April 6, 2026 18:20

Move per-layer reshape logic into inner loop

94fa930

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into canonical-kv-caches

2119286

orozery reviewed Apr 6, 2026

View reviewed changes

Etelis and others added 3 commits April 12, 2026 15:56

Merge branch 'main' into canonical-kv-caches

d6fbfbf

Merge branch 'main' into canonical-kv-caches

16e28c2

Etelis mentioned this pull request Apr 12, 2026

Canonical KV cache allocation + offloading integration orozery/vllm#1

Open

Merge branch 'main' into canonical-kv-caches

fa03305

orozery reviewed Apr 13, 2026

View reviewed changes

Etelis requested a review from xuechendi as a code owner April 19, 2026 15:31

Etelis requested a review from orozery April 19, 2026 15:35

Merge branch 'main' into canonical-kv-caches

4673d60

orozery reviewed Apr 20, 2026

View reviewed changes

Comment thread vllm/v1/worker/kv_connector_model_runner_mixin.py Outdated

Etelis requested a review from orozery April 23, 2026 19:10

Merge branch 'main' into canonical-kv-caches

7e6f1e5

orozery reviewed Apr 27, 2026

View reviewed changes

Comment thread vllm/v1/worker/kv_connector_model_runner_mixin.py Outdated

Comment thread vllm/v1/worker/kv_connector_model_runner_mixin.py Outdated

Comment thread vllm/v1/worker/kv_connector_model_runner_mixin.py Outdated

EtelisIBM and others added 2 commits April 28, 2026 10:48

Merge branch 'main' into canonical-kv-caches

7e6c93d

orozery approved these changes Apr 28, 2026

View reviewed changes

Merge branch 'main' into canonical-kv-caches

18fb58b

NickLucche reviewed May 6, 2026

View reviewed changes

markmc changed the title ~~Canonical KV Cache Allocation for HMA Models~~ [KV Connector] Canonical KV Cache Allocation for HMA Models May 6, 2026

	if len(kv_cache_config.kv_cache_groups) <= 1:
	if len(kv_cache_config.kv_cache_groups) < 1:

	# All groups must use AttentionSpec with uniform page size
	# Currently, all groups must use AttentionSpec with uniform page size
	# We plan to gradually relax this requirement to support other cases

		spec = kv_cache_config.kv_cache_groups[0].kv_cache_spec
		assert isinstance(spec, AttentionSpec)

		if len(stride_order) != len(kv_cache_shape) + 1:
		return False

Uh oh!

Conversation

Etelis commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Related PRs

Uh oh!

mergify Bot commented Mar 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented Mar 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Etelis commented Apr 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

orozery commented May 5, 2026

Uh oh!

NickLucche commented May 5, 2026

Uh oh!

orozery commented May 5, 2026

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Etelis commented Mar 23, 2026 •

edited

Loading